An investigation of the performance portability of OpenCL

نویسندگان

  • Simon J. Pennycook
  • Simon D. Hammond
  • Steven A. Wright
  • J. A. Herdman
  • I. Miller
  • Stephen A. Jarvis
چکیده

This paper reports on the development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite. An account of the design decisions addressed during the development of this code is presented, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types. The resulting platform-agnostic, single source application is benchmarked on a number of different architectures, and is shown to be 1.3–1.5x slower than native FORTRAN or CUDA implementations on a single node and 1.3–3.1x slower on multiple nodes. We also explore the potential performance gains of OpenCL’s device fissioning capability, demonstrating up to a 3x speed-up over our original OpenCL implementation.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Improving Performance Portability in OpenCL Programs

We study the performance portability of OpenCL across diverse architectures including NVIDIA GPU, Intel Ivy Bridge CPU, and AMD Fusion APU. We present detailed performance analysis at assembly level on three exemplar OpenCL benchmarks: SGEMM, SpMV, and FFT. We also identify a number of tuning knobs that are critical to performance portability, including threads-data mapping, data layout, tiling...

متن کامل

Cross-Platform OpenCL Code and Performance Portability for CPU and GPU Architectures Investigated with a Climate and Weather Physics Model

Current multiand many-core computing typically involves multi-core Central Processing Units (CPU) and many-core Graphical Processing Units (GPU) whose architectures are distinctly different. To keep longevity of application codes, it is highly desirable to have a programming paradigm to support these current and future architectures. Open Computing Language (OpenCL) is created to address this p...

متن کامل

Dealing with performance/portability and performance/accuracy trade-offs in heterogeneous computing systems: A case study with matrix multiplication modulo primes

We present the study of two important trade-offs in heterogeneous systems (i.e., between performance versus portability and between performance and accuracy) for a relevant linear algebra problem, matrix multiplication modulo primes. Integer matrix linear algebra methods rely heavily on matrix multiplication modulo primes. Double precision is necessary for exact representation of sufficiently m...

متن کامل

Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)

Computing systems have become increasingly complex with the emergence of heterogeneous hardware combining multicore CPUs and GPUs. These parallel systems exhibit tremendous computational power at the cost of increased programming effort. This results in a tension between achieving performance and code portability. Code is either tuned using device-specific optimizations to achieve maximum perfo...

متن کامل

Exploiting heterogeneous parallelism with the Heterogeneous Programming Library

While recognition of the advantages of heterogeneous computing is steadily growing, the issues of programmability and portability hinder its exploitation. The introduction of the OpenCL standard was a major step forward in that it provides code portability, but its interface is even more complex than that of other approaches. In this paper we present the Heterogeneous Programming Library (HPL),...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Parallel Distrib. Comput.

دوره 73  شماره 

صفحات  -

تاریخ انتشار 2013